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Abstract 

Existing technology can parse arbitrary context-free grammars, but 
only a single, static grammar per input. In order to support more pow- 
erful syntax-extension systems, we propose reflective grammars, which 
can modify their own syntax during parsing. We demonstrate and prove 
the correctness of an algorithm for parsing reflective grammars. The al- 
gorithm is based on Barley's algorithm, and we prove that it performs 
asymptotically no worse than Barley's algorithm on ordinary context-free 
grammars. 



1 Introduction 

A software project may involve many different languages with different pur- 
poses and complexities, each with its own "natural" syntax. Typically, these 
languages are segregated from each other, either appearing in separate files, 
or inside strings. But parenthesis-structured languages from the Lisp family 
support incremental syntax extension (via macro systems) . This extension pro- 
cess provides powerful integration, but the surface syntax is restricted to S- 
expressions. 

We beheve it is possible to bridge this gap and create macro systems with the 
syntactic power of arbitrary context-free grammars. However, new parsing tech- 
nology is needed to do so. In this paper, we propose reflective grammars, which 
allow a language designer to define an incrementally extensible base language. 



*This research was made possible by the US National Science Boundation under 
grant number CCB-0811015, "CPA-SEL: Developing a Theory of Hygienic Macros". 
tA shorter version of this paper appeared in LDTA 2011. [20] 
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In such a language, a valid sentence may contain strings matching productions 
dynamically added by the sentence itself. This happens in a structured fashion. 
Users of this language can use its extension construct to write in any surface 
syntax they want. 

These language extensions are dynamic in the sense that they occur in the 
same file in which they are used; they are structured in that they have well- 
defined scope; and they are recursive in that an arbitrary number of extensions 

may be nested. 

Our reflective grammars are based on context-free grammars. Although 
many modern languages can be made to fit into restricted subsets of context- 
free languages, such as LALR(l), context-free languages arc easier to understand 
and manipulate, and are closed under composition [12]. This means that they 
are more suitable for languages which are to be extended by the user. 

Others have demonstrated impressive speed improvements to the Earley and 
GLR algorithms [2,3, 14-16]. We believe that the historical performance moti- 
vations for using restricted subsets of context-free grammars no longer apply. 

A macro system could provide meaning to these syntactic extensions, but 
we do not present one here; this paper only covers parsing. 

In section 2, we describe reflective languages in more detail. Section 3 de- 
scribes a recognition and parsing algorithm. Section 4 proves an upper bound 
to the time taken by parsing. Sections 5 covers related work, and section 6 
discusses our conclusion and future work. 

2 Reflective languages 
Examples 

The crux of our examples is the special right-hand side symbol R. In the gram- 
mar G, the strings w that IR derives (denoted G \- R =^ w), are the strings in 
the set 

{wiW2 : G h (Gram) ^ wi and G' h S" W2}, 

where 

• (Gram) is a distinguished nonterminal in G such that strings derivable 
from (Gram) can be interpreted as grammars by an operation denoted 

H- 

• G' = G (B fwil, where ® creates a new grammar by combining the pro- 
ductions of two grammars, and 

• S' is the start symbol of G'. 



2 



For our examples, we will define a reflective grammar for a language con- 
taining numbers, identifiers, and function invocations in the style of C-like lan- 
guages. In addition to these conventional elements, the grammar accepts exten- 
sions, marked by pairs of curly brackets. The meaning of the extension symbol IR 
depends on the nonterminal (Gram), which we also must define, giving a BNF- 
like meta-syntax for reflective grammars. IR is represented in this notation as 
REFL. The start nonterminal of the resulting grammar is specifled immediately 
after gram. 

We assume that the nonterminals (Identifier), (Nonterm), (QuotedString), 
and (NaturalNumber) have been given appropriate definitions already. We also 
assume that whitespace is ignored, except that (Nonterm) and (Identifier) follow 
standard tokenization rules. Our parser implementation successfully processes 
all the examples we give. 
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A simple sentence in the language of this grammar is plus (1 , plus (2,3)). 
A sentence that uses its refiective capabilities to add simple infix operations is 

plusd, plus (2, 

{{ gram <Expr> 

<Expr> ::= <SimpleExpr> <0p> <Expr> ; 
<Dp> : := "+" ; 
end_gram 

3 + plus (4, 5 + 6) }} ) , 7) 

The extension recognizes the text between gram and end_gram inclusive as 
being derived from (Gram). It interprets the grammar extension, and after that, 
it expects a string derived from (Expr) in the extended grammar, which it finds: 
3 + plus (4, 5 + 6). The surrounding text, that is, plusd , plus (2, {{and 
}}) , 7) , is in the original grammar. This means that the sentence 



3 



plusd, plus (2, 

{.{. gram <Expr> 

<Expr> ::= <SiinpleExpr> <0p> <Expr> ; 
<Dp> : := "+" ; 
end_grain 

3 + plus (4, 5 + 6) }} ) , 7 + 8) 

is not in the grammar, because 7 + 8 is outside the IR that provided a new 
definition for (Expr). 

Extensions can be used to gradually build up more powerful languages. In 
the following example, still in the same base grammar, we add lambda expres- 
sions and then infix operations (we represent A as \, making the assumption 
that backslash is not already used as the escape character in string literals): 

plus (1 , 

{{ gram <Expr> 

<Expr> ::= "\" <Identifier> "." <Expr> ; 
<SimpleExpr> ::= "(" <Expr> ")" ; 
end_gram 
(\x. plus(2,x))( 
plus (3, 

■[■[ grarni <Expr> 

<Expr> : := <SimpleExpr> <0p> <Expr> ; 
<0p> : := "+" ; 
end_grcmi 
(\y. 4 + y)( 

5 + (\z. 6 + z)(7)) » )) » ) 

Note that the extension markers that this base grammar uses, {{ }}, have 
no special status in our system, and the user could choose to use them as another 
kind of delimiter, provided he or she did so unambiguously. The only reason 
they appeared in the base grammar at all because omitting them would have 
made extensions hard to read, and even made it ambiguous where a grammar 
extension ends after binary operations are permitted. 

However, suppose that the author of the base language lacked this fore- 
sight, and had written the extension rule as (SimpleExpr) R, instead of 
(SimpleExpr) ^ {{ IR }>. All would not be lost, because the user could have 
simply added and then used a new, better construct using REEL, which repre- 
sents the IR construct in our meta-syntax: 

plus(l, gram <Expr> 

<Expr> : := "{{" REEL "}}" ; 
end_gram 
■[■[ grcun <Expr> 

<Expr> : : = <SimpleExpr> <0p> <Expr> ; 
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<Dp> : := "+" ; 
end_gram 
2 + 3 }} ) 

The old and now ambiguous extension syntax still remains, however. This 
is because, for simplicity, we have omitted from these examples the ability to 
remove productions from grammars. It would be very easy to add this, however. 
Our formalism does not depend on any relationship between the grammar being 
extended and the extension, but to obtain the complexity bounds of section 4, 
it must be possible to compute the extension quickly. 

Definitions 

To define reflective grammars, wc first need some metavariables. Let t range 
over terminal symbols, A and B be nonterminals, a, (3, 7, and S be right-hand 
sides (strings of terminals, nonterminals, and of the distinguished symbol R), x 
be the input string of terminals, and let i, j, k, and I be indices into that string. 
We will use Xij to represent substrings of x. The indices are zero-based and 
half-open; i.e., x = Xq^^^^. The empty string will be represented with the symbol 
e. Wc will name other strings w. Finally, we will use G for a reflective grammar. 

A reflective grammar G consists of some set of productions {A ^ a) £ G, 
and a start symbol A = G. start. 

Semantics 

In order to define the meaning of a reflective grammar, we must deflne the 
meaning of right-hand sides. Wc write G h a x to mean that the right-hand 
side a derives the string x according to the grammar G. Right-hand sides are 
built recursively from terminals, nonterminals, and the IR symbol: 
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L-Empty 



L-Terminal 
G\- a => w 

G\- at ^ wt 

L-NONTERMINAL 

G\-a^wi {A^5)gG G\-S^W2 
G h aA ^1^2 

L-Reflection 

G \- a =^ Wi G h (Gram) W2 
G' = G(B HI (G'.start ^ <5) G G' G' ^ 5 ^ 
G h alR =^ ix;iW2W3 

We say x G i(G) (that is, x is in the language of G), iff G h G. start a;. 

We restrict © by forbidding the user from extending the special (Gram) 
nonterminal, and the nonterminals that make it up, because the interpretation 
function J—] is fixed, so it would not be able to interpret the newly-valid strings 
that (Gram) derives. However, a macro system using this parser could reason- 
ably permit extensions to (Gram) if the user supplied a translation from the 
extended notation for grammars into the original notation. Also, to make our 
complexity analysis simpler, we require that (Gram) be non-nuUable and appear 
on the left-hand side of only one production. 

3 Recognizer algorithm 

We next present an algorithm for recognizing the language of a reflective gram- 
mar G, based on the Earley recognizer algorithm [8] : 
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R-Start 

G. start ^ (5 e G 

(0, G.start -)■ -S, G) G 

R-Shift 

{i, A a-tp, G) G Sj Xj = t 
ii,A^at-l3,G) GSj+i 

R-Call 

{i, A -> a-B^, G) G Sj {B -^5)&G 
{3,B^-5,G)eSj 

R-Return 

(i, A a-Bp, G) G Sj {j, B 6-, G) G Sk 
{i,A^aB-p,G)eSk 



R- Parse-grammar 

(i, A a-R/3, G) G Sj ((Gram) ^- 7) G G 
(i, (Gram) -7, G) G Sj 

R-Refl-call 

(i, A -J- a-IR/3,G) G Sj 
{j, (Gram) ^ 7-, G) e Sk G' = G ® [x^, fe] (G'.start ^ (5) G G' 

{k, G'.start ^- G') G Sfe 

R-Refl-return 

(i, A a-R/3, G) G 5^- G' = G lxj,kj {k, G'.start ^> J-, G') G 5; 

(i,A-)-aR-/3,G) G Si 

An Earley recognizer accumulates Earley items. An Earley item is a tuple 

{i,A — > a-/3,G), where (A af3) G G, and the cursor (the • symbol) marks a 
position in the right-hand side a(3. The grammar G is not part of traditional 
Earley items; we have added it for our grammars. The algorithm collects sets 
Sj, where the set Sj corresponds to the jth character in the input string x. The 
algorithm places the Earley item («, A — >■ a-/3, G) in the set Sj only if G h a 
Xij. However, for efficiency's sake, the recognizer only generates that Earley 
item in the first place if it might be needed (the R-Call rule determines that 
a nonterminal might need to be recognized at a particular point). 

The recognizer proceeds strictly left-to-right. The rules R-Star,t and R- 
Call place items of the form {j, A — > -S, G) in locations where the nonterminal 
A is expected to "seed" recognition of an A. The R-Shift rule advances the 
cursor over an expected terminal. The R- Return rule advances the cursor over 
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an expected nonterminal, provided there exists a corresponding "finished" item 
of the form {j, A—^6-,G). 

The last three rules, R-Parse-grammar, R-Refl-call, and R-Refl- 
RETURN, are our additions to the algorithm. R-Parse-GRAMMAR and R-Refl- 
CALL are both "seed" rules, analogous to R-call. R-Parse-grammar fires 
when the recognizer reaches an R, and it starts to consume a string matching 
(Gram). When the (Gram) has been completely parsed, R-Refl-CALL creates 
an extended grammar, and descends into its start terminal. Finally, R-Refl- 
RETURN is analogous to R-Return; it is triggered by an Earley item that 
indicates that a string matching the extended grammar is completed, and it 
advances the cursor over the IR that was waiting on it. 

If G' = G(B lxj,ki, then we will say that G". location = (j, k) and G'. parent = 
G (note that G could be an extended grammar or just the base grammar). We 
will compare grammars in an intensional fashion. Two extended grammars will 
be equal exactly when their locations and parents are the same, which implies 
that, in fact, they posses exactly the same rules. This will decrease the com- 
plexity of executing the R-Refl-RETURN rule, and make equality comparisons 
between Earley items fast. 

The algorithm is considered to have recognized the string x in the language 
G iff it produces an Earley item of the form (0, G. start — >■ S-, G) in the last set, 

Parsing instead of recognizing 

There are two approaches to turn the recognizer into a parser. If ambiguous 
parses are to be rejected by the parser. Barley's simple technique suffices: In 
each Earley item, we associate each nonterminal to the left of the cursor with 
a pointer to the "completed" Earley item {j,B — )• 5-,G) that derives it. Items 
that have multiple pointers render any parse that uses them ambiguous. 

If a representation of all parses is desired, Scott's Buildtree algorithm [19] 

can be adapted easily to our recognizer. It depends on the recognizer anno- 
tating nodes with "predecessor" and "reduction" pointers. Therefore, when a 
rule produces an item (i, A a^-J, G) G Sk, where /3 is a single terminal, non- 
terminal, or R, it adds a predecessor pointer from it to the antecedent item 
{i,A^a-(3j,G) e Sj. When R-Return produces («, A aS-7, G) G Sk, it 
adds a reduction pointer from it to the antecendent item {j, B ^ 6-,G) G Sk, 
and when R-Refl-Return produces a rule of the form {i, A — 5- aR-j, G) € Sk, 
it adds a reduction pointer from it to the antecedent item (j, B ^ S-, G') G Sk- 

Scott's algorithm traverses the Earley items and builds up a shared packed 
parse forest. The symbol nodes [19, p. 59] are marked with a nonterminal and 
a beginning and ending position. In a reflective setting, these nodes must also 
have the grammar from which the nonterminal came, because a nonterminal is 
only meaningful in the context of some grammar. 
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Correctness 



Before we prove correctness, we present a slight rcforimilation of our semantics, 
where concatenation is represented indirectly, by taking substrings of the input: 



L-Empty 



G\- a ^ Xij Xj =t 

L-Terminal 



L-NONTERMINAL 



G \- at => Xx,j+i 

G I- a => Xij (A-)- S) €G G\- 6 ^ Xj^k 
G h aA Xik 



L-Reflection 



G h a Xij G h (Gram) xj^k 
G' = G® {xj^kl (G'.start 5) G G' G' h 5 ^ Xk,\ 
G \- qR ^ Xi I 



Proving the algorithm correct consists of two parts: that the algorithm rec- 
ognizes all strings in the language of the grammar ( "valid strings" ) and that it 
recognizes none that are not ("invalid strings"). 

Completeness Lemma (the algorithm recognizes all valid strings). 

G h G. start x implies (0, G. start — > 5-, G) G S\x\ 
Proof. We will first show that, given an input string x in the language of G, 
G h a Xi^j and (i, A — ^ -afi, G) G Si implies {i, A — >• a-/3, G) G Sj 

We will proceed by induction on the structure of the proof tree that G h 
a Xi^j. Each case corresponds to a rule for generating right-hand sides that 
recognize a string. 

L-Empty: a = e. G h e =^ Xij implies that Xi_j = e. This means that i = j. 
Therefore, the item {i, A — >■ -a/J, G) is the same as the item (i, A a-/3, G), 
and it is already in Sj. 

L-Terminal: G h a Xij and Xj = t. 

{i, A — > G) G Sj, by the induction hypothesis at G h 

{i,A at-P,G) G Sj+i, by R-Shift 

L-NoNTERMiNAL: G h a ^ Xij and, for some 6, B ^ 6 € G and G\- S ^ Xj^k- 
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{i, A a-Bp, G) G Sj 

{j,B^.S,G)eS„ 
{j,B^d;G)eSk, 



(i.A^ aB-P,G) G Sk, 

L-Reflection: G \- a Xij and G \- (Gram) 
and G' h G'. start => Xk,i- 
{i,A^ a-RP,G) G Sj, 



by the induction hypothesis at G h 

a Xij . 

by R-Call. 

by the induction hypothesis at G h 

S ^ Xj^k- 
by R-Return. 



and G' = Gi 



(j, (Gram) ^ -7,0) G Sj, 

There is a 7 such that (Gram) 
7 G G and G h 7 =^ Xj^k, 
(j, (Gram) ^7-,G) G Sk, 



by the induction hypothesis at G h 

by R-Parse-GRAMMAR, for all 
(G ^ 7) G G. 

by inversion of L-NONTERMINAL. 

by the induction hypothesis at G h 
7 Xj,;; (which is higher up in the 
proof tree, so the induction hypoth- 
esis may be applied.) 

by R-Refl-CALL. 



Let G' be Ge Ia;ij]. 
(A-,G'.start ^ ■5,G') G Sk, 
There is a (5 such that G'. start by inversion of L-NONTERMINAL. 
(5 G G' and G' \- 5 xui, 



{k,G'.staxt^ 5-,G') G 5;, 

{i,A^ a'R-l5,G) G 5;, 



by the induction hypothesis at G' h 

5 ^ Xk.i- 

by R-Refl-return. 



By the premise, a;o,|x| (that is, x) is in the language of G, so, for some 5, 
G \- 5 ^ X. By the R-Start rule, (0,G. start -J, G) G Sq. By the above 
argument, wo also know that (0, G. start — )■ S-,G) G S^^], which is to say that 
the algorithm has successfully recognized the string. 

□ 



GrammEir Origin Lemma (all extended grammars come from a parsed (Gram)). 



For any extended grammar G' = G © {xj^ki that appears in an Earley item, 
there exists some Earley item (j, (Gram) — )• 7-, G) G Sk- 

Proof. By induction on the recognizer rules; only R-Parse-GRAMMAR creates 

new grammars, and it obeys the above condition. □ 

Soundness Lemma (the algorithm recognizes no invalid strings). 

(0, G. start S-, G) G S\x\ implies G h G. start x 
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Proof. Wc will first show that, given a string x that our algorithm recognizes 
as being in the language of G, 

{i, A — > a-(3, G) G Sj implies G\- a ^ Xij 
We will proceed by induction on the structure of the proof tree that {i,A—^ 7-/?, G) G 

Sj. 

R-Shift: {i, A -)• a-i/3, G) € Sj-i and Xj = t. 

G h a Xij-i, by the induction hypothesis at 

{i,A~^ at-l3,G) e Sj-i. 

Ghat^Xij, by L-Terminal. 

R-Return: {i, A a-Bp, G) e Sj and (j, B ^ (5-, G) e 5fe. 

G h a ^ a;j,j, by the induction hypothesis at 

{i, A a-Bp, G) e Sj 

G\- 5 ^ Xj k, by the induction hypothesis at 

{3,B^5-,G)eSk. 
{B 6) G, by the definition of Earley items. 

G h aB => Xi^fc, by L-Nonterminal. 

R-Refl-RETURN: {i, A -)• a-IR/3, G) € S'j and G' = G® |a;j,fc] and (fc, G'.start G') € 
Si. 

G h a => ajjj-, by the induction hypothesis at 

{t,A^ a-Rl3,G) e Sj. 
{j, (Gram) — >■ 7-, G) G 5^, by the Grammar Origin Lemma. 

G h (Gram) => Xi^k, by the induction hypothesis at 

(j, (Gram) ^7-,G) £ S'fe. 

G' \- 6 ^ Xk,i, by the induction hypothesis at 

(fc, G'.start ^- 5-, G') G 5;. 

G'.start 6 € G', by the definition of Earley items. 

G h alR => Xj,;, by L-Reflection. 

All remaining rules produce Earley items of the form (i, ^4 — )■ -5, G) G 5,. 

G h e e, by Empty. 

Therefore, since the algorithm produced an Earley item of the form {i, G. start — >■ 6-, G) 
in the set S\x\, we know that G h 5 a;o,|a:|- Because (G. start — >■ ^) G G, we 
know that x is in the language of G. 

□ 

Correctness (the algorithm is correct). 

G h G.start ^x iff (0, G.start -)■ (5-, G) G ^i^i 

Proof. By the soundness and completeness lemmas above, the algorithm recog- 
nizes a string iff it is valid. □ 
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4 Complexity 



We will characterize the complexity of this algorithm in terms of both the length 

of the input string and the nature of extended grammars it defines. Let n be 
the length of the input string, and let g be the maximum size of any extended 
grammar defined. We define the size of a grammar to be the sum of the number 

of productions and the length of the right-hand sides. By this definition, there 
are only g distinct values of A — ^ a-/3 possible in a grammar of size g. 

At each input position, there is some set of grammars which might be the 
current grammar, given the part of the string to the left of the character. Let m 
be the maximum of the size of these sets, over the length of the string. Having 
TO be greater than 1 occurs in cases where something else shares syntax with a 
syntax extension construct, or when the extension is not terminated unambigu- 
ously, both of which arc undesirable in practice. However, in pathological cases, 
TO grows exponentially with n. We know to is always finite because grammar 
extensions are applied in the order encountered and (Gram) is non-nuUablc, 
so every grammar is uniqiicly defined by sequence of distinct nonovcrlapping 
nonempty substrings of the input string. It is possible to limit the value of to 
and abort parsing if it exceeds some preset value. 

Before we proceed, we must specify the behavior of and ©. Wc require 
that both of those take no more than 0(ngm) time. Most natural definitions 
will satisfy this easily, as the string x is no more than n characters long, and 

the grammars produced by © and \x\ have size no more than g. 

Now wc shall prove that recognition takes 0{n^g^m?) time. Our argument 
follows that of Ear ley [8] . 

First, we observe that the algorithm can be executed by first determining 
the contents of 5o, then 5*1, and so on, because the contents of each S never 
depends on an S further to the right. Furthermore, every rule that places an 
Earley item into set Si has as an antecedent the existence of an Earley item in 
Si, with the exception of R-Start and R-Shift. Imagining for the moment 
that each Si is a set that allows mutation by adding members, we sketch out a 
strategy for taking the closure of our rules: 

For each Si, in order, "seed" the set by executing R-Start if i = 0, or R- 
Shift on every appropriate item in Si-i otherwise. Now close the set over the 
remaining rules: Apply all rules to the new Earley items, the result of which 
becomes the new Earley items for the next iteration, repeating until no new 
items appear. 

This closure process is the heart of the algorithm. For each Earley item 
generated, it will execute the rules, and insert the resulting item (if any) into 
the appropriate set. There is one set of Earley items for each input character, 
so the asymptotic running time is 

number-of-input-characters x number-of-Earley-items-per-set x 
(rule-execution-time -|- items-produced-per-item x set-insertion-time). 
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There arc n input characters. Each set contains at most 0{ngm) Earley 
items: in the form (z, A a-/3, G\), there are n possible values of i, g possible 
values for A — > a-/3, and the number of distinct grammars G\ in the set is 
limited to m. 

If each set is represented as an array of length n containing linked lists of 
items, and an item anchored at i is stored in the list at index i of the array, 
there will be at most 0{gm) items in each linked list. To perform set insertion 
by adding elements to these lists, we also need to compare Earley items for 
equality quickly. It is possible to store all the components of our Earley items 
as indices for constant-time comparison. This is trivial for the anchor i and for 
the rule position A ^ a-j3, but requires explanation for the grammar G. The 
contents of grammars can be stored in a table, and each Earley item's reference 
to the current grammar can be stored as an index into that table. 'Wc have 
required that there only be one production of the form (Gram) 7, so for each 
grammar with location (i, j) and parent G', there is only one possible Earley 
item that can produce it via R-Refl-CALL. This means that newly created 
grammars are unequal to all existing grammars, so the table never needs to be 
searched. Therefore, comparing Earley items to each other takes constant time, 
and therefore inserting an Earley item into the set Si takes 0{gm) time. 

Now, all that remains is to determine, per input item, how long the rules 
take to execute, and how many items the rule produces. Each rule (other than 
R-Start, which takes 0{g) time to execute overall) has at least one Earley item 
as a antecedent. To apply the rule to an Earley item, wc substitute the item 
into the antecedent, and then test the remaining antecedents. This means that 
rules with two Earley items as antecedents will be attempted twice and succeed 
the second time. 

R-Shift This rule takes 0(1) time to test the expected terminal against the 
input string. It produces at most a single item. 

R-Call This rule needs to walk G, so it takes 0{g) time, producing at most 
0{g) items. 

R-Return We reproduce the rule below: 

R-Return 

{i,A^ a-Bp,G) e Sj {j,B 5-,G) £ Sk 
(i, A aB-l3, G) e Sk 

We will show that the rule takes 0(ngm) time and produces 0{ngm) 
items. It is always true that j < k, because the end of a production must 
not come before its start. There are two possible ways that an Earley item 
could be relevant to this rule:^ 

^Here, we differ from Earley by omitting a small optimization; he only tests items for 
applicability as the {j, B 5-,G) antecedent in the R- Return rule. This always works when 
j < k, and sometimes works when j = k. Additional work must be done to make this behave 
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If wc have the item (j, i? — > 5-, G) £ 5fe^, we know what j is and that aU 
matching items are in Sj. There are 0(rigm) items in Sj which need to be 
checked to see if they match {i, A — )• a-B^, G). All of them could match: 
this rule could produce as many as 0{ngm) items. 

But if we have the item {i, A — > a-Bj3, G) & Sj, the only matching Earley 
items that could have already been produced are those for which j = k. So, 
we need to search Sj, which takes 0{gm) time to produce 0{gm) items, 
because the anchor of the item we are looking for is known to be j. The 
fact that Sj is only partially complete at this point is of no consequence; 
whichever item arrives last in Sj will succeed in finding the other. 

R-Parse-grammar Like R-Call, this takes 0(()) time, producing at most 
0{g) items. 

R-Refl-call Computing G © \xj^k\ takes 0{ngm) time, as specified above. 

(Gram) is required to be non-nullable, so j < k, and therefore the {j, (Gram) 7-, G) G 
Sk item always appears last. Searching Sj for items matching {i, A — >■ a-IR/3, G) 
takes 0{ngm) time and produces at most 0{ngm) items. 

R-Refl-return G'. location = {j,k), and G'.parent = G. Other than that 
extra bookkeeping, this rule proceeds like R-Return. 

For each Earley item, executing the rules takes 0{ngm) time and produces 
up to 0{ngm) items. Each item that is produced needs to be inserted into the 
appropriate set (which, as wc saw above, takes 0{gm) time). The dcduplication 
performed by set insertion ensures we only have to execute the rules once per 
unique Earley item, even if the item is produced multiple times. Otherwise, 
execution time would be slower, and it would even diverge in the case of left- 
recursive rules. 

Our total running time therefore is n x 0{ngm) x {0{ngm) + 0{ngm) x 
0{gm)) = 0{n^g^m^). If the rules R-Parse-grammar, R-Refl-call, and R- 
Refl-RETURN are omitted, the original Earley algorithm is recovered. The R- 
Return rule, which remains, can still take 0{ngm) time and produce 0{ngm) 
items, so the complexity is the same without the reflective rules. Since Earley 
supports a single grammar of fixed size, g and m are constants. This is consis- 
tent with Earley 's 0{n'^) result. Our system is therefore "pay-as-you-go": its 
reflective features have no asymptotic cost if they are not used. 

Earley recognition provides further performance guarantees in cases where 
the input obeys certain restrictions. We have not examined whether those same 
guarantees apply to our work. 

correctly in the presence of nuUable productions. Aycock [2] discusses three different solutions 
to this problem. 

^An anonymous reviewer points out that the value of <5 is irrelevant in executing this 
rule, therefore, an intermediate rule could collapse all items of the form (i, B S-, G) G 5^ 
into a special item {i,B — ^ G) 6 Sfc, which the R- return rule could look for instead, 
reducing the number of times it executes. However, this would not have an asymptotic effect 
on performance; the number of distinct possible values of B — > □, like the number of distinct 
possible values of B <5-, is in 0(p). 
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Buildtree complexity 

The Buildtree algorithm of Scott [19], introduced in section 3, can be used to 
construct parse trees (based on Earlcy items) when the results of ambiguous 
parses are needed in a compact format. (An ambiguous grammar may parse a 
sentence exponentially many or even infinitely many ways.) 

Scott's complexity analysis asserts that Buildtree takes time proportional to 

number-of-input-characters x number-of-Earley-items-per-set x 
predecessor-items-per-item 

The number of predecessor items an Earley item may have, as in Scott's 
work, is n. To see this, observe that an item where the cursor follows a nonter- 
minal, 

{i,A^ aB-P,G) G Sj 
can have as predecessor any item of the form 

{i,A^ a-BP,G) e Sk 

where < k < j. This same argument applies to cases where the cursor follows 
a IR. 

On the other hand, if the cursor follows a terminal, there is exactly one 
predecessor, and items where the cursor is at the beginning of the right-hand 
side have no predecessor. 

The number of input characters is n. As above, the number of Earley items 
in each of our sets is 0{ngm). So executing Buildtree requires 0{n^gm). This 
means that Buildtree, which takes place only once (after recognizing is com- 
pleted), requires less time than recognizing, so it does not affect the overall 
complexity. 

5 Related work 

Parsers 

The idea of modifying an Earley parser to parse a more powerful class of gram- 
mars was inspired by YAKKER [11], a powerful Earley-based parser for depen- 
dent grammars. A dependent grammar can, for example, recongize the language 
of strings containing a literal number n followed by a sequence of precisely n 
characters. 

Derivative-based parsing [17] is an approach to parsing context-free lan- 
guages in which the parse state at a given character is simply a grammar rep- 
resenting the language of strings that are valid suffixes to the already-parsed 
portion. The authors suggest that it could be used to implement reflective 
grammars, but supply no details. 
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Like context-free grammars, parsing expression grammars (PEGs) can be 
composed by combining productions to produce a legal grammar [10]. However, 
the ordered choice provided by PEGs is not a true union, and "incorrect order- 
ings can cause suble errors" [12]. For example, adding an if . . . then construct 
can turn an existing if . . . then. . . else construct into a syntax error. 

Language extension systems 

There are a variety of systems that tackle the issue of syntax extensibility. Each 
work in this category is a complete system that tackles both the issue of parsing 
and the issue of transformation. We will only cover the comparable portion 
here, the parsers. 

A few of these systems parse input using some kind of dynamic grammars 
which, like ours, support multiple grammars in one file. 

Kolbly [13] describes a syntax extension system with an Earley-based parser 
that can parse different regions of a file in different grammars. However, all 
grammar extensions must be predefined by the language designer — the user 
cannot extend the language. 

Another macro system with flexible syntax is ZL [1]. It allows new syntax 
to be added to C, though a system of iterated re-parsing. However, it restricts 
what syntactic forms the user may add. 

Although Dylan's macro system [4] does not involve any special parser tech- 
nology, it does loosen Lisp's parentheses to a "syntactic skeleton" , giving macro 
authors more control over the appearance of macro invocations. 

Gel [9] is a language syntax that, by requiring adherence to whitcspace 
conventions, correctly parses code that looks like Java, CSS, Smalltalk, and 
ANTLR. Their goal is in some ways a mirror image of ours: they unify a set 
of existing syntaxes into one large syntax, while we describe how a single small 
syntax can be extended into many others in the bounds of one file. 

The Silver project [21] is a system for describing and extending languages, 
and transforming those languages using attribute grammars. Schwerdferger and 
Van Wyk describe [18] a static analysis for language extensions which ensures 
that, given a host language, any number of these extensions can be added to the 
host language, and the result will be LALR(l), as their parser requires. However, 
they must significantly restrict the permissible forms of syntax extensions in 
order to do so. 

Metafront [5] is a system for deflning languages and transformations between 
them. They describe a novel type of grammar called a "specificity grammar" . 
In such a grammar, more specific productions have priority over less specific 
productions. Although composing their grammars can produce errors, these 
errors can be expressed entirely in terms of the productions involved, rather 
than as confusing shift /reduce and reduce/reduce confiicts. They also have 
what they describe as a macro system; however, their macro definitions always 
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have the scope of an entire file, so they can use existing parser technology. 

A system described by Cardelli, Matthes, and Abadi [6] discusses incre- 
mentally extending grammars by adding productions (and grammar restriction, 
where productions are removed). It rejects compositions of grammars that are 
not LL(1), but provides powerful integration between grammar definitions and 
transformations. 

Camlp4 [7] is a preprocessor for the Ocaml language. It allows the user to 

extend the Ocaml syntax. It allows the language designer to select what parser 
the resulting, extended, language will be parsed with, but the user must select 
one language per file. 

6 Conclusion and future work 

We have defined a class of grammars that specify languages that can modify 
their own syntax during parsing. We have presented an algorithm that can 
parse these reflective grammars and can parse nonreflective grammars as fast as 
an ordinary Ear ley parser. Furthermore, we have placed bounds on how costly 
the reflective feature is, in terms of how it is used. 

We intend this work as the first step in building a macro system applicable 
to languages that lack parenthesis-based syntax. Our next steps will be to 
define requirements for a powerful and usable macro system, and describe how 
such a macro system would interact with this parser. In such a system, there 
would be no special syntax for macro invocation, so user-defined syntax would 
be indistinguishable from core syntax. With the dynamic power of our parser, 
it would be possible to have local definitions for macros, and even to import 
macros in a restricted scope. 
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